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Abstract 

The deep sequencingof 1 6S rRNA genes amplified by universal primers has revolutionized our understand- 
ing of microbial communities by allowing the characterization of the diversity of the uncultured majority. 
However, some universal primers also amplify eukaryotic rRNA genes, leading to a decrease in the efficiency 
of sequencingof prokaryotic 1 6S rRNA genes with possible mischaracterization of the diversity in the micro- 
bial community. In this study, we compared 1 6S rRNA gene sequences from genome-sequenced strains and 
identified candidates for non-degenerate universal primers that could be used for the amplification of pro- 
karyotic 1 6S rRNA genes. The 50 identified candidates were investigated to calculate their coverage for pro- 
karyotic and eukaryotic rRNA genes, including those from uncultured taxa and eukaryotic organelles, and a 
novel universal primer set, 342F-806R, covering many prokaryotic, but not eukaryotic, rRNA genes was iden- 
tified. This primer set was validated by the amplification of 1 6S rRNA genes from a soil metagenomic sample 
and subsequent pyrosequencing using the Roche 454 platform. The same sample was also used for pyrose- 
quencing of the amplicons by employing a commonly used primer set, 338F-533R, and for shotgun metage- 
nomic sequencing using the lllumina platform. Our comparison of the taxonomic compositions inferred by 
the three sequencingexperiments indicated that the non-degenerate 342 F-806R primer set can characterize 
the taxonomic composition of the microbial community without substantial bias, and is highly expected to be 
applicable to the analysis of a wide variety of microbial communities. 
Keywords: 1 6S rRNA; primer design; non-degenerate primer; microbial community 



1 . Introduction 

Culture-independentand 1 6S rRNA gene-based ana- 
lysis is widely used to profile microbial communities.^ 

"I" These authors contributed equally to this work. 



Such profiling of 1 6S rRNA genes has revolutionized 
our understanding of microbial communities by allow- 
ing the characterization of the diversity of the uncul- 
tured majority.^ Moreover, ongoing development of 
massively parallel DNA sequencing technologies 
allows deep sequencing of a sample and simultaneous 
sequencing of many samples per run via low-cost 
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bar-coding techniques.^ Because of its advantages over 
the traditional Sanger sequencing, the massively paral- 
lel sequencing has been applied to large-scale sequen- 
cing projects, including the analysis of 1 6S rRNA gene 
amplicons from diverse microbial communities."^'^ 

The Roche 454 Genome Sequencer FLX system^ is 
commonly used for the massively parallel sequencing 
of the 1 6S rRNA gene amplicons.^ Several studies have 
suggested that short (450-600 nt) reads obtained by 
the 454 platform^ are sufficient for the taxonomic as- 
signment and the subsequent community comparison 
when the regions in 1 6S rRNA genes to be sequenced 
are carefully chosen.^ Notably, appropriate primer se- 
lection is a critical issue for analysing diverse microbial 
communities.^ The biasarisingfrom poorcomplemen- 
tarity between primers and 1 6S rRNA genes in particu- 
lar taxa leads to underestimation of these taxa in a 
microbial community.' ' Although the primer sets 
specific to Bacteria (e.g. the 338F and 533R set, which 
PCR-amplifies and covers the V3 regions of 1 6S rRNA 
genes)' ^ have generally been used for community pro- 
filing,^'' ^ recent studies of /4rc/?flec/ have revealed its ubi- 
quitous distribution in diverse environments and its 
important roles in various ecosystems (e.g. ammonia 
oxidization in soil and seawater).'^ To investigate the 
relative composition of Bacteria and Arciiaea and their 
ecological interaction in a microbial community,' uni- 
versal primer sets that can amplify both bacterial and 
archaeal 1 6S rRNA genes are needed.'^ Although 
many sets of universal primers have been designed for 
this purpose,^'' °'' ^~' ^ some such primers have led to 
the amplification of eukaryotic 1 8S rRNA genes and 
small-subunit rRNA genes from mitochondria and 
chloroplasts.'°''^ Moreover, it is often difficult to esti- 
mate the number and diversity of the eukaryotic rRNA 
genes present in the environment before conducting 
the sequence-based community analysis.' ^ Therefore, 
the use of primers that amplify rRNA genes of eukaryot- 
ic origins has led to mischaracterization of the diversity 
of microbial communities that were associated with eu- 
karyotic cells.' ^'^° This unfavourable situation is very 
serious in the case of host-associated habitats (e.g. the 
human skin, oral cavity, and vagina).^' 

The aim of this study istoovercome the problems asso- 
ciated with primerchoice, and we attempted to identify 
the novel universal primers that complement only con- 
served bacterial and archaeal 1 6S rRNA gene sequences 
so that eukaryotic and organelle rRNA gene sequences 
would not be amplified. In general, the commonly 
used degenerate primers do not always achieve a 
better taxonomic coverage than the non-degenerate 
primers,^^ and other recently designed degenerate 
primers still introduce a level of bias into the community 
profile.^^ Therefore, our present study was focused on 
only the non-degenerate universal primers. The primer 
set identified in this study was experimentally used to 
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show the taxonomic composition of a soil microbial 
community by amplifying 1 6SrRNA genes from the mi- 
crobial metagenomic sample and subsequent pyrose- 
quencing using the Roche 454 system. To evaluate the 
appropriateness of our original primer set to the taxo- 
nomic bias that might result from the PGR amplification, 
the same metagenomic sample was also used for (i) pyr- 
osequencingof the 1 6SrRNA gene fragments aftertheir 
amplification by use of the 338F and 533R set and (ii) 
the lllumina Genome Analyzer llx (GA llx) platform- 
based shotgun^"^ sequencing of the metagenome and 
subsequent finding and analysis of the 1 6S rRNA gene 
fragments, and the taxonomic compositions inferred 
by the three sequencing experiments were compared. 

2. Materials and methods 

2. / Identification of universal primer candidates from 
genome-sequenced strains 

To identify the candidate primersequencesforthePCR 
amplification of prokaryotic 1 6S rRNA genes, such genes 
from genome-sequenced strains were used as references 
because they have been accurately sequenced, are 
full-length genes, have well-defined taxonomic informa- 
tion. Bacterial and archaeal genomic sequences were 
obtained from the NCBI Genome Database (ftp://ftp. 
ncbi.nih.gov/genbank/genomes/Bacteria/, accessed on 
11 November 2013) in November 2008. The 1 6S 
rRNA gene sequences in each strain were identified by 
RNAmmer.^^ Then, one 1 6S rRNA gene sequence per 
species was randomly chosen because slight sequence 
differences exist among the 1 6S rRNA genes from 
strains of the same species,^^ and among the gene 
copies within a genome.^^ A total of 53 1 1 6S rRNA gene 
sequences were chosen. Their taxonomic information 
was obtained from the NCBI Taxonomy Database (http:// 
www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome. 
html/, accessed on 1 1 November 201 3). A multiple se- 
quence alignment of the 531 1 6S rRNA gene sequences 
was constructed using MAFFT version 6.71 3 with default 
parameters.^^ 

To f i nd out t he ca nd idate seq uences descri bed a bove, 
highly conserved regions identified in the reference 
alignment were chosen as follows. Generally, the 
primer lengths for the PCR-amplification of 1 6S rRNA 
genes are more than 1 5 nt;^ therefore, we used a 
sliding window of 1 5 nt with a step size of 1 nt across 
the reference alignment. For each window, we calcu- 
lated the frequency of each 1 5-nt sequence with one 
mismatch allowed. The 1 5-nt sequences that included 
gaps were also considered when calculating the fre- 
quencies. The consensus sequence for each window 
was defined as the 1 5-nt sequence that was found 
most frequently within one mismatch among strains. 
The coverage rate for a consensus sequence in each 
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phylum was defined as the percentage of matched 
sequences among genome-sequenced strains within 
one mismatch. 



2.2 Evaluation of each candidate universal sequence 
using the databases of the Ribosomal Database 
Project and the comparative RNA Web site 

Each candidate sequence was evaluated using the 
Probe Match program in the Ribosomal Database 
Project (RDP) (release 10, update 22; http://rdp.cme. 
msu.edu/probematch/search.jsp, accessed on 1 1 
November 2013).^'' The 615 916 1 6S rRNA gene 
sequences in the RDP were retrieved by using the para- 
meters Strain = Both, Source = Both, Size >1 200, and 
Quality = Good, and then these sequences were used 
as reference sequences of the Probe Match program for 
the evaluation of the candidate sequences. The coverage 
for a candidate sequence in each phylum of the reference 
sequences wascalculated asthe percentage of genera for 
which more than half of the membersequences in each 
genus were found to match the candidate sequence 
within one mismatch. Because the reference sequences 
contain many uncultured strains,^^ their taxonomic 
classifications are ambiguous, especially at the species 
level (e.g. description of the sp. as the taxa). Therefore, 
we used the genus, but not the species, to calculate the 
coverage for each candidate sequence. More than half 
of the 1 6S rRNA gene sequences in the reference 
sequences lacked their 5'- or 3'-terminal regions at the 
time of our analysis (http://rdp.cme.msu.edu/down 
load/releasel 0_2 2_containPosition.xls, accessed on 
1 1 November 2013). Therefore, the candidate se- 
q uences correspond ing to such regions were excluded 
from the subsequent analyses. 

The possibility of amplifying eukaryotic and organ- 
elle rRNA genes was evaluated by calculatingthe cover- 
age for each primer candidate and the eukaryotic 
and organelle rRNA genes that were obtained from 
the comparative RNA Web (CRW).^° Aligned rRNA 
sequences of chloroplasts, mitochondria, and eukar- 
yotes were separately downloaded from the CRW 
(http://www.rna.ccbb.utexas.edu/DAT/3A/Summary/ 
index.php, accessed on 11 November 2013). The 
sequences containing ambiguous nucleotides (e.g. 
any nucleotide, N, or pyrimidine, Y) were discarded. 
One 1 6S rRNA sequence per genus was randomly 
chosen to reduce taxonomic bias in the dataset. After 
the conversion of uracil residues to thymine ones in 
the sequences, the coverage for each candidate se- 
quence in the CRW-derived and subsequently 
processed sequences (76 chloroplastic,480 mitochon- 
drial, and 583 eukaryotic ones) was calculated as the 
percentage of matched sequences within two mis- 
matches to eliminate the possible amplification of 
such rRNA genes. Based on the results of prior 



studies,' ^ we assumed that three or more mismatches 
between a primer and template would not allow the 
successful PGR amplification. The candidate sequences 
that covered more than 50% of the mitochondrial and 
eukaryotic rRNA gene sequences were also excluded 
to avoid the amplification of such genes. To reduce 
the possibility of random and non-specific amplifica- 
tion of non-rRNA genes, we merged neighbouring can- 
didate sequences to extend their sequence lengths 
while maintaining the coverage for each phylum. 

The validity of our primers was compared with the 
validity of 1 7 other universal primers published in the 
literature (Supplementary TableSl ).^'' We inves- 
tigated the coverage of our and the 1 7 other primer 
sequences for (i) the 1 6S rRNA genes from bacterial 
and archaeal phyla in the reference sequences and (ii) 
the GRW-derived chloroplastic, mitochondrial, and eu- 
karyotic rRNA genes described above. To avoid the diffi- 
culties of /n silico evaluation, all ambiguous nucleotides 
in the 1 7 primers were converted to the nucleotides 
that gave rise to the best coverage for the reference 
sequences of bacterial and archaeal origins. 

2.3 Sample processing and sequencing of a soil microbial 
community 

A brown forest soil sample was collected in April 2008 
from farmland at the Ehime Research Institute of 
Agriculture, Forestry, and Fisheries, Matsuyama, Japan,^^ 
sieved through a 2-mm mesh to remove larger particles, 
and transferred to a sterilized glass pot with a loose lid. 
After its water content was adjusted to 60% of the 
maximum water-holding capacity, the soil was incubated 
at 28°Gfor 24 weeks. Total DNA was thereafter extracted 
from 1 0 g of soil using a PowerMaxSoil DNA Isolation kit 
(MoBio, Garlsbad, GA, USA) according to the manufac- 
turer's instructions. The DNA sample was concentrated 
by ethanol precipitation to obtain sufficient amount of 
DNAforPGRand metagenomic sequencing. PGR amplifi- 
cation of 1 6S rRNA genes was performed in 50 |jlI of 1 x 
ExTaq buffer(2 mMMg^+ Plus;Takara Bio, Ohtsu, Japan) 
containing 0.2 mM dNTPs, 0.625 \J of TaKaRa ExTaq HS 
(Takara Bio), 0.2 |jlM each of the forward and reverse 
primers, 2% dimethylsulphoxide, 0.01% bovine serum 
albumin, and 30 ng template DNA. Two PGR primer 
sets, 342F-806R and 338F-533R,'^ were used. The 
PGR programme consisted of a single cycle of 95°G for 
30 s, followed by 25 cycles of 95°G for 30 s, 55°G for 
30 s, and 72°Gfor 1 5 s. Afinal extension was performed 
at 72°Gfor 7 min. PGR ampliconsfrom eight parallel reac- 
tions were individually concentrated and purified by gel 
electrophoresis, and subsequently extracted to obtain 
more than 2 |xg of amplicons. Pooled amplicons were 
sequenced on a 454 GS FIX Titanium one-sixteenth pico- 
titer plate according to the manufacturer's protocols. In 
addition, l-|jLg portions of metagenomic DNA were 
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sheared to approximately 200-bp fragments using a 
Covaris-S instrument (Covaris, Woburn, MY, USA), and 
adapter sequences were ligated to both ends of the 
DNA fragments to generate a paired-end library. The 
library was subjected to 76 cycles of paired-end sequen- 
cing with an lllumina GA llx instrument (one lane of an 
eight-lane flow cell) using reagents of an lllumina 
Sequencing kit, version 3.0, followed by base-calling 
using GA pipeline, version 1 .5.0. 

2.4 Sequence analysis 

We used the high-quality 454 platform reads after 
removal of the reads that (i) contained ambiguous 
nucleotides, (ii) contained <150 or >250 nt in the 
338F-533R experiment and <350 or >550 nt in the 
342F-806R experiment, and (iii) were associated with 
an average Phred-like quality score of less than 20 as cal- 
culated by the 454 sequencer. Both the forward and 
reverse primer sequences were removed using cross- 
match, version 1.09051 8 (-minmatch 10 -minscore 
1 2) (http://www.phrap.org/phredphrapconsed.html, 
accessed on 11 November 2013). Reverse-comple- 
mented reads were converted so that all reads started 
according to standard-sense strand conventions. 
Chimera sequences were detected and removed by 
the ChimeraSlayer in mothur version 1.11,^^ using 
the Silva reference alignment (http://wvw.mothur.org/ 
w/images/f/fl /Silva.gold.bacteria.zip, accessed on 1 1 
November 2013). We also discarded the lllumina 
reads that (i) were evaluated as quality 'N' by purity 
filtering using the GA pipeline version 1 .5.0 and (ii) con- 
tained more than one nucleotide with the quality-flag 
'B' in the first 60 nt. 

From the 615 916 1 6S rRNA gene sequences that 
were used as the reference sequences (see above), we 
eliminated (i) the 1 1 7 059 sequences whose origins 
at the genus-level were designated as 'unclassified,' 
and (ii) forty 23S rRNA gene-derived sequences with 
lengths longer than 2000 nt. Six sequences (S0013 
37843, S001337931, S001337937, S00133801 5, 
SOOl 3381 73, and SOOl 3381 77) that contained un- 
usually long homopolymers were also discarded 
because such homopolymers have returned many hits 
in our sequence similarity searches of the lllumina 
reads. Consequently, we obtained a database consisting 
of a total of 498 81 1 high-quality 1 6S rRNA gene 
sequences whose origins at the phylum and genus 
are definite, and we designated this database the 
'in-house high-quality RDP' database. This database 
for Bacteria and Arcliaea was merged with the 
CRW-derived and above-described rRNA gene se- 
quence database for chloroplasts, mitochondria, and 
eukaryotes. 

Taxonomic assignment of the high-quality reads that 
were obtained from the 454 amplicon sequencing by 



use of the 338F-533R and 342F-806R primer sets 
was performed by a BLASTN^"^ search [e-value 
<le-8; Identity >94% (genus) or >85% (phylum); 
and Alignment length >100 (for 338F-533R) or 
>300 (for 342F-806R)] against our merged database 
described above. The thresholds for alignment length 
were necessary toeliminate inaccurate orfalse-positive 
alignments. Identification and taxonomic assignment 
of 1 6S or 1 8S rRNA gene fragments in the reads 
obtained by the lllumina metagenomic sequencing 
was performed by a BLASTN search [e-value <le-5; 
Identity >94% (genus) or >85% (phylum); and 
Alignment length >50] of our merged database. 

When a best hit with the largest BLAST bit score existed, 
one was added tothetaxon ofthe best hit forcalculation 
of the taxonomic composition ofthe microbial commu- 
nity. When two or more database sequences had the 
same best BLAST bit score, one divided by the number 
of best hits was added to the corresponding taxa ofthe 
best hits. To improve the accuracy of the taxonomic 
assignments of the lllumina reads, we used only the 
read pairs whose sequences (i) were identified as those 
in a 1 6S rRNA gene from the same taxon and (ii) had a 
consistent alignment direction between the paired 
reads. The distance between the paired reads was not 
considered because the 1 6S rRNA genes in some taxa 
contain long insertions.^^ 

2.5 Nucleotide sequence accession numbers 

Sequences have been submitted to the DDBJ 
Sequence Read Archive under the following accession 
numbers: DRROOl 479 and DRROOl 475 for the 454- 
read data obtained using the 338F-533R and 342F- 
806R primer sets, respectively, and DRROOl 464 for 
the lllumina read data by metagenomic sequencing. 

3. Results and discussion 

3 . 1 Identification of universal primer candidates for 1 6S 
rRNA genes from genome-sequenced strains 
There is a large collection of 1 6S rRNA gene se- 
quences in the databases,^^'^^'^° and almost all such 
sequences were obtained by use of PCR. Therefore, the 
sequences that failed to match the universal primers 
do not exist in the database.^^ Furthermore, many 
sequences in the database still contain the primer 
sequences used for PCR.^'' On the other hand, the 1 6S 
rRNA gene sequences from the genome-sequenced pro- 
karyotic strains contain the complete 1 6S rRNA gene 
sequences, including both terminal regions, which 
are missing in many cases in the databases (e.g. see 
http://rdp.cme.msu.edu/download/releasel 0_2 2_ 
containPosition.xls, accessed on 1 1 November 201 3). 
Therefore, the 1 6S rRNA gene sequences from the 
genome-sequenced strains are expected to provide 
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Figure 1 . A partial plot of the consensus sequence in each window and its coverage rate for each phylum from 531 1 6S rRNA genes of genome- 
sequenced strains. The number of species in each phylum is indicated in parentheses. The start position of each window is represented 
according to the corresponding position in the 1 6S rRNA gene from Escherichia coli. The consensus sequence in each window is 
represented at the bottom of the figure. Each line indicate the coverage rate for the consensus sequence of each phylum by using the 
colours of dots: black <50%, blue <80%, green <90%, yellow <1 00%, and red = 1 00%. The bar graph at the top of the figure indicates 
the sequence variability of each window that is depicted by (i) calculating the relative entropy of four nucleotides and the gap (-) at 
each site^^ and (ii) summing up the relative entropy of each site in the window. 



US accurate information on the variabi I ity of 1 6S rRNA 
genes from many different taxa. 

Information concerning the consensus sequence for 
each 1 5-nt window and its coverage rate for 1 6S rRNA 
genes from the genome-sequenced strains of each 
phylum is shown in Figure 1 (some of the windows) 
and Supplementary Figure SI (all windows). The 
windows in which almost all of the dots for the archaeal 
(top four) and bacterial lines showed >90% sequence 
identities (coloured red or yellow) were the universal 
primer candidate sequences. The candidates with con- 
served sequences for almost all bacterial and archaeal 
phyla were identified, and 50 such sequences are listed 
in Supplementary Table S2. Although the list was 
prepared based on the genomic data available in 
November 2008, the use of the updated data in July 
2013 did not affect the conservation pattern of 50 
universal primer candidate sequences across phyla 



(SupplementaryTableS3).Our 50 universal primercan- 
didate sequences are well overlapped with the degener- 
ate primer candidates proposed by Wang and Qian 
(Supplementary Table S4),^ in spite that their algorism 
and database used were different from ours. Since the 
maximum sequencing read by the 2nd-generation 
sequencers at the time of our study was up to 450 nt, 
17 out of the 50 candidate universal primer sequences 
that are located at the 3'-terminal regions of 1 6S rRNA 
genes (e.g. sequences 1 390-1 395, 1 491-1496, and 
1 525-1 526, Supplementary Fig. SI) were excluded 
for further analysis. Because the latest systems allow 
the determination of sequences up to 1 1 00 nt in 
length (e.g. the PacBio sequencer^ and 454 GS FLX+ 
system; http://454.com/products/gs-flx-system/index. 
asp, accessed on 1 1 November 2013), the excluded 
sequences might be useful as primer candidates in the 
near feature. 
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3.2 Coverage for our candidate primer sequences and 
other universal primer sequences in bacterial, 
archaeal, eukaryotic, and organelle rRNA genes 

We calculated the coverage for each of the remaining 
33 candidates for rRNA genes from each bacterial and 
archaeal phylum in our in-house high-quality RDP 
database and from chloroplasts, mitochondria, and 
eukaryotes in the CRW-derived and above-described 
database, and the results are depicted as a heat map 
(Supplementary Fig. S2). All of the candidate sequences 
covered most of the chloroplastic rRNA genes within 
two mismatches, indicating the difficulty in avoiding 
their amplification during microbial community profil- 
ing. Twenty-one out of the 33 primer candidates 
[514-522 (nine candidates), 781 -788 (seven candi- 
dates), and 907-91 5 (five candidates)] had relatively 
large coverage in bacterial and archaeal 1 6S rRNA 
genes. However, these 21 candidate sequences also 
covered the mitochondrial and eukaryotic rRNA genes 
within two mismatches; therefore, these sequences 
were not further considered as candidates. In cases in 
which the targeted microbial community contains only 
a small fraction of Eukaryota, these primers might be a 
good choice for the community profiling. 

The remaining 1 2 candidates [341 -343 (three can- 
didates), 789-792 (four candidates), 879-880 (two 
candidates), and 1 060-1 066 (three candidates)] had 
sequences that appeared to exclude possible amplifica- 
tion of mitochondrial and eukaryotic rRNA genes. 
Because (i) the maximum length of a high-quality read 
for the 454 GS FLX Titanium system was ~450 nt and 
(ii) the longer sequences provide more accurate taxo- 
nomic assignments,^ we chose candidates 341-343 
and 789-792 as forward and reverse universal primer 
candidates, respectively. After merging the sequences 
of the neighbouring candidates, but maintaining the 
coverage in each phylum, we finally chose the sequence 
from 342 to 3 57 (5'-CTACGGGGGGCAGCAG-3'; 1 6 nt: 
342 F) and that from 790 to 806 (5'-GGACTAC 
CGGGGTATCT-3'; 1 7 nt: 806R) as potential forward 
and reverse universal primers, respectively. 

The coverage for our universal primer set and the 1 7 
other universal primers (Supplementary Table SI) in 
each of the bacterial and archaeal phyla rRNA genes, 
and the chloroplastic, mitochondrial, and eukaryotic 
rRNAgenes is represented asa heat map (Fig. 2). The pre- 
viously designed primers,^'' 519F-534R (four 
primers), 784F, 906F, and 9 2 6R, covered mitochondrial 
and/or eukaryotic rRNA genes within two mismatches. 
The coverage for our universal primer set was, when 
compared with those for the 1 7 other universal 
primers, relatively large for bacterial and archaeal rRNA 
genes; however, relatively poor coverage was observed 
for some phyla (e.g. Crenarchaeota, Planctomycetes, 
ODl, and Chlamydiae). Furthermore, our universal 



primers had four mismatches (on average) against mito- 
chondrial and eukaryotic rRNA genes. Most of the 1 9 
primers covered chloroplastic rRNA genes within two 
mismatches (Supplementary Table S5). This result and 
the data in Supplementary Fig.S2 indicate that all of uni- 
versal primersequences examined inthis study would be 
difficult to avoid the amplifications of chloroplastic rRNA 
genes. Findinga universal primerthat avoidsthe amplifi- 
cation of chloroplastic rRNAgenes is an important issue 
to be carried out in the future. 

In general, the broad-range primers targeting bacter- 
ial and archaeal 1 6S rRNA genes tended to also amplify 
eukaryotic rRNA genes.' When researchers 
analyse the microbial communities that contain many 
eukaryotic cells, the use of broad-range primers will 
lead to a decrease in the efficiency of sequencing and 
the mischaracterization of the microbial diversity in 
each community.'^ Such communities are exemplified 
by eukaryotic host-associated ones: for example, the 
human microbiome project revealed that the average 
ratios of human DNA in the total metagenomic DNA 
extracted from several human body sites were about 
50%.^' A broad-range primer, 51 5F, is used as a 
standard universal primerto analyse microbial commu- 
nities in several environments in the Earth Microbiome 
Project (http://www.earthmicrobiome.org/, accessed 
on 1 1 November 201 3). Since the 51 5 F primer perfect- 
ly matches to many eukaryotic, mitochondrial, and 
chloroplastic rRNA genes, some studies that analysed 
animal-associated environments reported that the use 
of this primer also led to the amplification of many 
eukaryotic, mitochondrial, and chloroplastic rRNA 
genes.^^ We primarily aimed to identify novel non- 
degenerate universal primers that were expected to 
amplify most bacterial and archaeal rRNA genes, but 
not eukaryotic rRNA genes. Because of this aim, our 
novel universal primers could not easily amplify the 
1 6S rRNA genes from some bacterial and archaeal taxa 
without bias. Examples of such underrepresented taxa 
are the archaeal phyla, Crenarchaeota, Korarchaeota, 
and Nanoarchaeota, and the bacterial phyla, Chlamydiae, 
Lentisphaerae, ODl, OPll, Planctomycetes, and SRI 
(Fig. 2). If these taxa are the major members in a micro- 
bial community of an environment, use of the 342F- 
806R primer set for the amplicon sequencing analysis 
will give rise to significantly biased taxonomic compos- 
ition. To avoid misunderstanding of the compositions, 
the prior information related to the taxonomic com- 
position of the community is very important. For 
example, Crenarchaeota is a major archaeal phylum;"^" 
thus, in the analysis of the microbial community rich 
in this phylum, the combined use of our universal 
primer set with an established /4rc/?flefl-specific univer- 
sal primer set' °'' ^ will allow more accurate description 
of the composition of the community. Since both of the 
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primer sets can amplify 1 6S rRNA genes from 
Euryarchaeota, the relative abundance of this phylum 
in the community can be inferred, allowing us to cali- 
brate the relative abundance of Crenarchaeota to that 
of Bacteria in the community. 

3 . 3 Comparison of soil microbial community profiling by 
pyrosequencing witli thetwo sets of primer pairs and 
by lllumina metagenomic sequencing 
Experimental validation of our universal primers was 
conducted by 1 6S rRNA gene-based profilingof asoil mi- 
crobial community with the 454 pyrosequencing. To 
assess whether our primer set enabled us to amplify 
most bacterial taxa in the soil sample in comparison 
with other primer sets, a commonly used bacteria- 
specific primer set, 338F-533R,^^ was also used forthe 
454 pyrosequencing. The use of 1 6S rRNA gene frag- 
ments obtained by shotgun metagenomic sequencing 
can eliminate any bias that might be associated with 



the initial PCR amplification of 1 6S rRNA genes for the 
pyrosequencing."^^ Therefore, metagenomicsequencing 
was additionally performed using the lllumina platform 
with paired-end libra ries of thesame sample to examine 
whether use of our universal primers resulted in taxo- 
nomic bias. Our sequencing and subsequent filtering 
generated 1 5 165 (338F-533R) and 27 527 (342F- 
806R) high-quality 454 reads by 1 6S rRNA gene ampli- 
con sequencing, and 1 2 825 888 high-quality lllumina 
reads (6 412 944 read pairs) by metagenomicsequen- 
cing. A total of 302 6 lllumina reads were assigned to 
1 6S rRNA genes. 

Our use of the two primer sets generated no 454 
readsthat were assigned to mitochondrial oreukaryotic 
rRNA genes from our refined CRW database (see 
Materials and methods), although eight lllumina read 
pairs were assigned to 1 8S rRNA genes from some 
fungi. Therefore, our universal primers did not allow 
the detection of the amplification of mitochondrial or 
eukaryotic rRNA genes. 
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Figure 2. The coverage for our universal primer set (red) and the 1 7 other universal primers in each of the bacterial and archaeal phyla rRNA 
genes, and thechloroplastic, mitochondrial, and eukaryotic rRNA genes. Each column indicates the coverage for a candidate sequence from 
each of the bacterial and archaeal phyla rRNA genes, and the chloroplastic, mitochondrial, and eukaryotic rRNA genes by colour intensity: 
blue ^ low coverage, and pale orange ^ high coverage. See Supplementary Table SI for each sequence. The number of genera is provided 
in parentheses. 
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Figure 3. Phylum-level taxonomic compositions of prokaryotes in the soil metagenome on the basis of the 338F-533R and 342F-805R 
amplicon pyrosequencing and the 1 6S rRNA gene fragments from lllumina metagenomic sequencing. The top 1 8 abundant phyla from 
the results of lllumina metagenomic sequencing are listed. 



Figure 3 shows the results of a BLASTN search for 
phylum-level taxonomic assignment of each 454 and 
lllumina read against our merged in-house high-quality 
RDP and CRW sequence database. The numbers of taxo- 
nomically assigned reads at the phylum level were 
1 3 591 (89.62% of the total high-quality 454 reads for 
the 338F-533R experiment), 26 738(97.13% of the 
total high-quality 454 reads for the 342F-806R experi- 
ment), and 1513 for the lllumina read pairs (3026 
reads) (Supplementary Table S6). Among the three 
types of experiments, the taxonomic compositions at 
the phylum level exhibited good Spearman'scorrelations 
(P-value < 0.05): illumina versus 338F-533R, r = 0.75; 
lllumina versus 342F-806R, r= 0.81 ; and 338F-533R 
versus 342F-806R, r= 0.86. These three experiments 
identified Proteobacteria and Acidobacteria as the major 
phyla in the soil microbial community. The composition- 
al bias for each phylum among the three experiments 
was evaluated by investigating the rank abundance of 
each phylum in each experiment (Fig. 4). In comparison 
with the lllumina metagenomic sequencing data, the 
454 amplicon sequencing using the 338F-533R and 
342F-806R primer sets gave rise to data in which the 
phyla Verrucomicrobia, Planctomycetes, Crenarchaeota 
and Euryarchaeaota, and the phyla Planctomycetes, 
Crenarchaeota, ODl , and Chlamydiae, respectively, were 
underrepresented. Such underrepresentation might 
have occurred because most 1 6S rRNA genes from the 
phyla had sequences with more than two mismatches 
with the 338F and 342F primer sequences. The use of 
the 33 8F-533R primer set generated no 454 reads 
that were assigned to Euryarchaeota, being consistent 
with the specific amplification of only bacterial 1 6S 



rRNA genes by this primer set. Conversely, eight 454 
reads by the 342F-806R experiment and ten read pairs 
by the lllumina metagenomic sequencing were assigned 
tothisphylum,indicatingthat 342 F-806Rcan simultan- 
eously amplify bacterial and archaeal 1 6S rRNA genes. 

The genus-level taxonomic assignment for the 454 
and lllumina reads is depicted in Supplementary Figure 
S3. The numbers of reads assigned at the genus level 
were 9956 (65.65% of the total high-quality 454 reads 
for the 338F-533R experiment), 20 431(74.22% of 
the total high-quality 454 reads for the 342 F-806R ex- 
periment), and 1041 (lllumina read pairs) (Supple- 
mentary Table S7). The top 20 abundant genera identi- 
fied in each experiment accounted for more than 70% 
of the taxonomically assigned reads in all three experi- 
ments. Most of the top 20 most-abundant genera in 
each experiment (Supplementary Fig. S4) were also cate- 
gorized as those inthetwootherexperiments,and atotal 
of the top 26 most-abundant genera were used forcom- 
parative analysis. Compared with the lllumina metage- 
nomic sequencing data, the 454 amplicon sequencing 
data led to an underrepresentation of three genera 
(the genera Subdivisions and Spartobacteria in the 
Verrucomicrobia phylum, and the genus Zavarzinella 
in the Planctomycetes phylum) and two genera 
{Zavarzinella and ODl in the ODl phylum) in the 
338F-533R and 342F-806R experiments, respectively. 
Underrepresentation might have also occurred because 
the 1 6S rRNA genes from these genera had sequences 
with more than two mismatches for the 338F or 342F 
primer. The 1 6S rRNA genes in certain genera (e.g. 
Conexibacter and Zavarzinella) were underrepresented 
in the 342F-806Ror 338F-533R data when compared 
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Figure 4. Comparison of three experiments for ranl< abundances of phyla. Large circles indicate the rank abundances of each phylum in the 
three experiments, starting at the rank abundance 1, followed by 6, 11, 1 6, 21, and 26. Phyla are listed in a clockwise direction 
according to the rank abundance in the lllumina experiment. 



with the lllumina metagenomic sequencing data, and 
their underrepresentation could not be explained by 
the primer m ismatches. The formation of (a) specific sec- 
ondarystructure(s)ordifferences in the accuracies of the 
taxonomic assignments for the 454 and lllumina reads 
might have been responsible for the underrepresenta- 
tion. The Spearman's correlations calculated for the 
abundance of the 26 genera between each pairof experi- 
ments were: 0.37 (lllumina versus 338F-533R), 0.71 
(lllumina versus 342F-806R), and 0.60 (338F-533R 
versus 342F-806R). Comparison atthe genus-levels indi- 
cated that the taxonomic composition inferred by the 
338F-533R amplicon sequencing correlated weakly 
with that by the lllumina metagenomic sequencing. In 
contrast, the taxonomic composition by the 342F- 
806R amplicon sequencing correlated very well with 
that by the lllumina metagenomic sequencing, indicat- 
ing that our universal primers can amplify a broad 
range of taxa with a low bias. 

Since the lllumina sequencing of 1 6S rRNA gene frag- 
ments in the metagenome produced no chimera or 
bias that is associated with the initial PCR step of ampli- 
con sequencing, the taxonomic composition of the mi- 
crobial community based on this sequencing method is 
highly reliable.'*^ However, the inference of taxonomic 
composition in a microbial community obtained using 
the metagenomic 1 6S rRNA gene fragments requires 
ultra-deep sequencing since such fragments represent 
only a very small fraction of the metagenome."^^ 



Indeed, our use of the lllumina GA llx platform for meta- 
genome sequencing of the soil microbial community 
gave rise to only thousands of reads as 1 6S rRNA gene 
fragments from >1 0 million total reads. This observa- 
tion impliesthatthequantification of relative abundance 
in a microbial community will be reliable for major taxa 
but unsuitable for minortaxa. Furthermore,the metage- 
nomic 1 6S rRNA gene fragments corresponded random- 
ly to a number of different regions within the 1 6S rRNA 
genes, indicating the difficulty in conducting a detailed 
comparison of metagenomic 1 6S rRNA gene fragments 
(e.g. construction of a phylogenetic tree). To characterize 
the taxonomic composition of a microbial community 
including its minortaxa at low cost, the amplification of 
1 6S rRNA genes using appropriate universal primers is 
necessary. The taxonomic composition inferred by the 
454 amplicon sequencing with our non-degenerate 
universal primer set correlated well with that obtained 
by lllumina metagenomic sequencing, indicating that 
our universal primers can characterize microbial com- 
munities withoutsubstantial biasand with low possibility 
to amplify eukaryotic rRNA genes. Although we did not 
carry out additional experimental validation of our 
novel and non-degenerate universal primer set by using 
other environmental samples rich in eukaryotic and 
mitochondrial rRNA genes, the results in this study 
indicate that our primer set is highly expected to be 
applicable to the analysis of many diverse microbial 
communities. 
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